In this report I will look to highlight the firms that should be focused on by our supervisors using key metrics that I will explain throughout the report. I will keep all my code in the report for audit and reusability purposes as conclusions may change as new data arrives. I will focus this piece on the three main characteristics stated: Firm Size, Changing Business Profile and Outliers.

Data Loading and Wrangling

I have used a combination of openxlsx and tidyverse packages to render the xlsx into a R-readable output. Below I have produced a summary of all the variables.

firm year value_type eof_for_scr_m equity_m gwp_m gross_bel_m gross_claims_incurred_m gross_combined_ratio gross_expense_ratio nwp_m net_bel_m net_combined_ratio net_expense_ratio pure_gross_claims_ratio pure_net_claims_ratio scr_m scr_coverage_ratio total_assets_m total_liabilities_m
Length:9120 Length:9120 Length:9120 Min. : -162.25 Min. : -123.610 Min. : -19.78 Min. : -167.82 Min. :-189.936 Min. : -14064 Min. :-16112.4 Min. :-17754.10 Min. : -182.382 Min. :-5468706 Min. :-4495420 Min. :-24817.8 Min. :-33721.0 Min. : 0.000 Min. : -3 Min. : -207.4 Min. : -1487.1
Class :character Class :character Class :character 1st Qu.: 5.98 1st Qu.: 5.798 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 1.794 1st Qu.: 1 1st Qu.: 14.4 1st Qu.: 3.8
Mode :character Mode :character Mode :character Median : 32.09 Median : 34.276 Median : 15.47 Median : 10.48 Median : 3.392 Median : 0 Median : 0.1 Median : 7.95 Median : 4.502 Median : 0 Median : 0 Median : 0.2 Median : 0.1 Median : 14.533 Median : 2 Median : 113.0 Median : 63.0
NA NA NA Mean : 455.68 Mean : 480.865 Mean : 853.94 Mean : 259.07 Mean : 116.696 Mean : 1859 Mean : 273.7 Mean : 689.12 Mean : 164.284 Mean : 172 Mean : 1370 Mean : 463.0 Mean : 388.1 Mean : 308.437 Mean : 1301158 Mean : 6333.0 Mean : 6363.4
NA NA NA 3rd Qu.: 165.65 3rd Qu.: 176.937 3rd Qu.: 198.96 3rd Qu.: 159.34 3rd Qu.: 79.463 3rd Qu.: 1 3rd Qu.: 0.3 3rd Qu.: 125.95 3rd Qu.: 93.771 3rd Qu.: 1 3rd Qu.: 0 3rd Qu.: 0.5 3rd Qu.: 0.5 3rd Qu.: 95.191 3rd Qu.: 3 3rd Qu.: 955.4 3rd Qu.: 655.7
NA NA NA Max. :41636.30 Max. :26705.042 Max. :74078.64 Max. :19292.07 Max. :6844.014 Max. :3978266 Max. :575064.8 Max. : 75526.67 Max. :11351.609 Max. : 2445529 Max. : 2691212 Max. :985011.3 Max. :828809.4 Max. :22788.359 Max. :999302877 Max. :553549.9 Max. :494498.5
NA NA NA NA’s :2620 NA’s :2620 NA’s :2620 NA NA NA NA NA’s :2620 NA NA NA NA NA NA’s :2620 NA’s :2620 NA’s :2620 NA’s :2620

From this summary we can see each of the columns are in the correct data type and there is significant variance across the numeric variables. The dataset looks like this now:

firm year value_type eof_for_scr_m equity_m gwp_m gross_bel_m gross_claims_incurred_m gross_combined_ratio gross_expense_ratio nwp_m net_bel_m net_combined_ratio net_expense_ratio pure_gross_claims_ratio pure_net_claims_ratio scr_m scr_coverage_ratio total_assets_m total_liabilities_m
Firm 1 2016 mean_value 484.0712 249.6647 9.353606 0.1177988 0.0011030 0.1350101 15.45025 -3550.820 1.978919 14.25977 8.202612 3.506938 3.436862 404.2644 11177400.6440186 12737.29 8034.337
Firm 1 2016 median_value 0.0000 0.0000 0.000000 0.0000000 0.0000000 0.0000000 0.00000 0.000 0.000000 0.00000 0.000000 0.000000 0.000000 0.0000 0.0000000 0.00 0.000
Firm 1 2016 sd_value 1080.2952 555.8116 20.915299 0.2634062 0.0024665 0.3018918 34.54781 7939.875 4.424999 31.88581 18.341598 7.841752 7.685056 903.9626 24993427.5609242 28476.09 17965.323
Firm 1 2016 time_series_value 2416.5598 1243.9250 46.768030 0.0000000 0.0000000 0.0000000 0.00000 -17754.100 0.000000 0.00000 0.000000 0.000000 0.000000 2021.3219 0.1625197 63676.87 40171.683
Firm 1 2017 mean_value 484.0712 249.6647 9.353606 0.1177988 0.0011030 0.1350101 15.45025 -3550.820 1.978919 14.25977 8.202612 3.506938 3.436862 404.2644 11177400.6440186 12737.29 8034.337

I have set up the datatable so that only value_type time_series_value varies over time, all the other variables in this column stay consistent. This makes it straightforward to wrangle to find the top/bottom n firms or most/least varying firms by variable as you will see in my visualisation code.

Caveats

pull_unique_firms = function(wb, sheet_n){
  
  openxlsx::readWorkbook(
      xlsxFile = wb, 
      sheet = openxlsx::sheets(wb)[sheet_n])['X1'] %>%
    dplyr::pull()
    
}

firms_sheet1_vec = pull_unique_firms(wb, 1)
firms_sheet2_vec = pull_unique_firms(wb, 2)
  
length(setdiff(firms_sheet2_vec, firms_sheet1_vec))

There are some caveats which will be worth considering with my data-centric approach to resource allocation, these are:

Correlation Plot

Firm Size

To consider firm size I will look at:

As we want to consider both magnitude and consistency of the variables, I will use the median score to select the highest rated firms and will plot them against the rest.

From this graph we can see there are five firms that are distinctly larger than the rest and I have only considered the top 20% of firms by median value. It is worth highlighting, and I will explore later, the volatility of some of the firms accounts as we can see firm 311 and 210 have large drops in 2020 (YE).

Only Firms 210 and 4 are considerably larger than the rest of the top 20% of firms (for this variable).

Firm 4 has appears twice now, firstly for having a significantly larger Net Written Premium and now for Equity. This suggests to us that it is a largely profitable insurance firm with a large equity holding meaning it is in a very healthy financial position. I would suggest this firm is definitely an important player in the market and one to monitor.

firm size_score nwp_m_abs_sd_normalised net_bel_m_abs_sd_normalised pure_gross_claims_ratio_abs_sd_normalised
Firm 210 29.276608 87.153007 0.6719286 0.0048892
Firm 105 20.448670 51.795731 9.5449170 0.0053618
Firm 4 19.790848 59.012878 0.3564204 0.0032472
Firm 311 11.391854 33.933678 0.2394160 0.0024683
Firm 26 10.151985 29.369312 1.0823845 0.0042572
Firm 199 8.642304 25.674708 0.2488247 0.0033785
Firm 151 6.559077 19.274517 0.4007712 0.0019427
Firm 25 5.705377 16.702546 0.4107381 0.0028466
Firm 73 5.326167 15.714369 0.2590468 0.0050841
Firm 52 5.125987 1.249713 14.1235304 0.0047166

I have created a table here that provides a potential list of the top 10 firms using only firms that reported on the general template also. I have calculated this by taking the mean normalised absolute standard deviation of the median value for each firm on specific metrics. To explain this, I first take the median, because mean is more affected by outliers, value of each metric for each firm and then find the absolute standard deviation against the metric. This allows me to find the firms that are, on average, far above the population for each metric. As the distributions vary for each of the metrics I then normalise each of this scores so that when I take the mean there is no weighting bias. I chose the variable Net Weighted Premium, Net BEL (inc. TPs as a whole, pre-TMTP) and Pure Gross Claims Ratio because as you can see from my correlation matrix above, these three variables explain a lot of the variance of the entire dataset and I felt they were therefore wide ranging and mutually exclusive enough to consider all aspects of the business. It is worth noting as you can see from my outputs there is an issue of possible outliers that I will consider and explore later.

Changing Business Profiles

Volatility of a firm’s accounts are of considerable concern to Supervisors and so it’s important to investigate.

I will start with considering variation the metric level. I am going to use relative standard deviation as this will allow me to compare SDs across each of the variables effectively.

This chart shows us that RSD is varies across metrics over time and that there are multiple metrics with high variance however there are six main metrics that have a high relative standard deviation all of which are ratios these are: gross_expense_ratio, net_expense_ratio, gross_combined_ratio, net_combined_ratio, pure_gross_claims_ratio, and pure_net_claims_ratio. Although there are 6 they are 3 highly correlated net and gross pairs. There are also strong relationships by definition. net_combined_ratio is \(\frac{Sum of net claims and expenses incurred}{Net Earned Premium}\) while net_expense_ratio is just the expenses part of the equation suggesting this could be an expenses variance. We also see a massive spike in 2016 for RSD in gross_expense_ratio which I would suspect is misreporting especially given no spike in net_expense_ratio and I will explore later. I will look into the firms that have produced these high variations in the net variables as these will likely explain the variations in the gross variables also.

We can see from this graph that net_expense_ratio spikes has potentially come from these two enormous values that have been reported. I will add a filter to this chart and the other ratio charts to have median values between 0 and 0.87 (this is the range set by 3Q + 1.5 * IQR). I will be consistent in flagging and filtering outliers in this way.

After filtering out misreporting firms, we can see the five firms whose data has changed most substantially over the period 2016-20. This is important for supervisors to be aware of because its ability to damage investors confidence in the firm as well as conduct any reasonable long-term financial planning.

As Firm 418 has misreported both net_combined_ratio and net_expense_ratio, we can assume the issue is in a misreporting of their expenses which has had a knock-on impact on multiple of their larger KPIs.

Again, once these are removed we can see that multiple firms with volatile ratios. Firms 287, 291 and 417 having both a volatile and high net combined ratio would be cause for concern for supervisors.

This is calculated as \(\frac{Net Claims}{Net Earned Premium}\) and so a firm like 38 who has managed to reduce this from a poor position is a positive and something that would stop concern from supervisors.

Outliers and Misreporting

I’ve shown a couple of techniques I have used to filter out outliers and highlight elements in the time series. I have chosen to use the interquartile range method due to the presence of extreme outliers in the data. Standard Deviation is influenced much more strongly by extreme outliers. I have created a flagging system, if the data point is outside of \((Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR,)\) for the entire metric population then it is a ‘IQR Outlier’, if it is outside of the 2nd or 8th quantile then it is labelled a ‘Decile Outlier’ else it has ‘No Flag’. I will provide an example below for firm 12.

We can see there is fair homogeneity in submissions and there is no systemic reporting issues towards any particular metric.

This approach will allow users to do is to flagged submissions quickly, if we take Firm 139’s scr_m and Firm 222’s nwp_m submissions we can see that they have a range of flags throughout the series.

firm year scr_m nwp_m
Firm 139 2016 148.44436 700.60412
Firm 139 2017 102.00252 678.88901
Firm 139 2018 384.08633 1013.57043
Firm 139 2019 102.85466 208.04965
Firm 139 2020 137.17073 468.09469
Firm 222 2016 157.07521 223.59939
Firm 222 2017 92.17920 423.15529
Firm 222 2018 286.40581 291.80749
Firm 222 2019 66.80772 153.44594
Firm 222 2020 78.87480 79.30072

If we graph that we can see this:

So this method allows us to look at each firm’s submission against the rest of the population and detect whether we consider it an outlier or not. In this case we can safely assume the IQR Outlier will require a resubmission.

For the machine learning calculations, I am going to remove all IQR outliers to help with explanability.

filtered_df = dplyr::anti_join(
  x = df %>%
    filter(value_type == 'time_series_value') %>%
    pivot_longer(cols = -c('firm', 'year', 'value_type')),
  y = outliers_df %>%
    pivot_longer(cols = -c('firm', 'year')) %>%
    filter(value == 'IQR Outlier'),
  by = c('firm', 'year')
  ) %>%
  # we are going to describe the whole time series so just need firm and metric
  dplyr::group_by(firm, name) %>%
  dplyr::summarise(
    # this are the variables I will use (for now)
    median_value = median(value, na.rm = T),
    # mad_value = mad(value, na.rm = T)
  ) %>% 
  tidyr::pivot_wider(
    names_from = name, 
    values_from = median_value)
  
 kable(head(filtered_df)) %>%
  kable_styling("striped", full_width = F)
firm eof_for_scr_m equity_m gross_bel_m gross_claims_incurred_m gross_combined_ratio gross_expense_ratio gwp_m net_bel_m net_combined_ratio net_expense_ratio nwp_m pure_gross_claims_ratio pure_net_claims_ratio scr_coverage_ratio scr_m total_assets_m total_liabilities_m
Firm 1 0.00000 0.000000 0.000000 0.000000 0.000000 0.0000000 0 0.0000000 0.0000000 0.0000000 0 0.0000000 0.0000000 0.000000 0.000000 0.00000 0.0000000
Firm 101 0.00000 0.000000 0.000000 0.000000 0.000000 0.0000000 0 0.0000000 0.0000000 0.0000000 0 0.0000000 0.0000000 0.000000 0.000000 0.00000 0.0000000
Firm 102 0.00000 0.000000 -22.240795 157.288903 1.428893 0.5209757 0 -14.5213022 1.0620728 0.5463933 0 0.6224033 0.3246686 0.000000 0.000000 0.00000 0.0000000
Firm 103 18.47424 47.916448 0.000000 0.000000 0.000000 0.0000000 0 0.0000000 0.0000000 0.0000000 0 0.0000000 0.0000000 2.595551 1.123171 64.46984 0.6099226
Firm 106 0.00000 0.000000 7.185743 1.579704 0.000000 0.0000000 0 0.1947479 0.0000000 0.0000000 0 0.0000000 0.0000000 0.000000 0.000000 0.00000 0.0000000
Firm 108 12.98278 4.660648 2.386303 41.942028 0.338826 0.0501317 0 3.3149043 0.6461881 0.0183560 0 0.3933908 0.9932665 3.596400 1.049790 12.44512 16.7744278

Machine Learning Application

One of the strengths of the PRA’s supervisory infrastructure is that we are able to categorise firms based on their size, business model, area of operation etc. Each of these characteristics require their own supervisory approaches which the PRA can adapt to and enables more effective regulation. For my machine learning application I will look to cluster the banks into various distinct categories using principal component analysis so that we can better define them and our approach to them. I have chosen this cleaning approach before clustering because as a dimension reduction process it can help to extract the most important elements of each of the features and reduce the dataset into a more comprehensible size. With the multicolinearity problems that are shown at the start, this approach will mitigate this issue and improve cluster quality. With tidymodels we can easily create a pipeline from input to PCA output

pca_recipe <- recipes::recipe(firm ~ ., data = filtered_df)  %>%
  # we only want to sue complete rows
  recipes::step_filter_missing(
    all_numeric(),
    threshold = 0) %>%
  # normalise all columns
  recipes::step_normalize(all_numeric()) %>%
  # compute principal components
  recipes::step_pca(
    all_numeric(), 
    threshold = .95
    )

pca_prep = recipes::prep(x = pca_recipe, training = filtered_df)

# plot this and cluster
pca_output_df = recipes::bake(object = pca_prep, filtered_df)

 kable(head(pca_output_df)) %>%
  kable_styling("striped", full_width = F)
firm PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Firm 1 -1.617780 0.0828767 -0.0006132 -0.0815506 0.0299363 0.0090921 -0.0247785 0.0283331
Firm 101 -1.617780 0.0828767 -0.0006132 -0.0815506 0.0299363 0.0090921 -0.0247785 0.0283331
Firm 102 4.860253 -2.8471428 1.3587408 -3.5633653 0.1309897 0.0477675 0.5332307 0.4841562
Firm 103 -1.617780 0.0828767 -0.0006132 -0.0815506 0.0299363 0.0090921 -0.0247785 0.0283331
Firm 106 -1.566940 0.1563090 0.0361121 -0.1119333 -0.0219634 0.0247992 0.0113430 0.0330784
Firm 108 1.609362 -0.6238122 -2.2371248 -1.0468584 0.5804565 1.0327301 -0.5352348 1.0128553
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 2.345756 0.9705261 0.7830859 0.6875206 0.6415613 0.5592552 0.5286541 0.4974662 0.46721
Proportion of Variance 0.611400 0.1046600 0.0681400 0.0525200 0.0457300 0.0347500 0.0310500 0.0275000 0.02425
Cumulative Proportion 0.611400 0.7160500 0.7841900 0.8367100 0.8824400 0.9172000 0.9482500 0.9757500 1.00000

This provides us with 7 principal components that explain over 95% of the total variance of the dataset.

I am now going to plot PC1 against PC2 to see if there are obvious clusters before performing k-means clustering.

There are few discernible groups using just PCs 1 and 2 alongside ggplot2’s geom_bin2d function to highlight clusters. I will use k-means clustering, an unsupervised ML technique used to group these points together minimising within-cluster variance and it is an incredibly efficient and explainable algorithm. The latter point is important when sharing findings with non-technical colleagues.